202211.00420v1 


chinaXiv 


ChinaXivA (ERAT! 
RESEARCH PAPER 


Fuzzy-Constrained Graph Pattern Matching in Medical 
Knowledge Graphs 


Lei Li'?**, Xun Du’, Zan Zhang? & Zhenchao Tao* 


‘Key Laboratory of Knowledge Engineering with Big Data (the Ministry of Education of China), Hefei University of Technology, 
Hefei 230601, China 
“Intelligent Interconnected Systems Laboratory of Anhui Province (Hefei University of Technology), Hefei 230601, China 
*School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China 


“The First Affiliated Hospital of University of Science and Technology of China, Hefei 230031, China 


Keywords: Graph pattern matching; Medical Knowledge Graphs; Fuzzy constraints; Breast cancer; Diagnostic 


classification 


Citation: Li, L. et al.: Fuzzy-Constrained Graph Pattern Matching in Medical Knowledge Graphs. Data Intelligence 4(3), 599-619 
(2022). DOI: 10.1162/dint_a_00153 
Received: Oct. 10, 2021; Revised: Jan. 15, 2022; Accepted: Apr. 10, 2022 


ABSTRACT 


The research on graph pattern matching (GPM) has attracted a lot of attention. However, most of the 
research has focused on complex networks, and there are few researches on GPM in the medical field. 
Hence, with GPM this paper is to make a breast cancer-oriented diagnosis before the surgery. Technically, 
this paper has firstly made a new definition of GPM, aiming to explore the GPM in the medical field, 
especially in Medical Knowledge Graphs (MKGs). Then, in the specific matching process, this paper 
introduces fuzzy calculation, and proposes a multi-threaded bidirectional routing exploration (M-TBRE) 
algorithm based on depth first search and a two-way routing matching algorithm based on multi-threading. 
In addition, fuzzy constraints are introduced in the M-TBRE algorithm, which leads to the Fuzzy-M-TBRE 
algorithm. The experimental results on the two datasets show that compared with existing algorithms, our 
proposed algorithm is more efficient and effective. 


t Corresponding author: Lei Li (E-mail: lilei@hfut.edu.cn, ORCID: 0000-0002-5374-7293). 
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1. INTRODUCTION 


As a basic data structure, graphs are widely used in a lot of applications. For example, as for object 
anomaly checking, objects can be represented by graphs, and then anomalies can be discovered with 
certain graph algorithm [1]. Meanwhile, in order to determine whether a user is interested in a certain 
webpage, the webpages can be converted into multiple graphs, and with the multiple graphs taken as a 
bag, the bag can be classified and judged [2]. As a popular graph-based technology, graph pattern matching 
(GPM) has attracted a lot of attentions. Specifically, given a pattern graph, finding subgraphs from the data 
graph with a similar or the same structure as the pattern graph is named as GPM. As the research field of 
GPM has changed from the initial protein isomorphism [3, 4] to community detection [5, 6], expert 
discovery [7], recommendation systems [8], the discovery of social groups [9-11] and the identification of 
criminal groups [12], the definition of graph pattern has also changed accordingly. 


Technically, GPM is originally defined based on subgraph isomorphism. Given a data graph Gp and a 
pattern graph Gp as input, it will return whether it contains a subgraph, and whether this subgraph has 
exactly the same topological structure as Gp. For example, we can guess the properties of unknown proteins 
from the properties of known proteins through this matching [3, 4]. However, this traditional subgraph 
isomorphism is too strict. In order to extend the application scenarios of GPM, Fan et al. [12] propose a 
bounded simulation, which extends the edge-to-edge matching to the edge-to-finite length path matching. 
However, this matching still does not make use of the rich attribute information on vertices and edges. 
Therefore, Liu et al. [13] propose a multi-constrained graph pattern matching (MC-GPM) problem to obtain 
more effective matching results. Afterwards, Liu et al. [14] propose a multiple fuzzy constrained graph 
pattern matching (MFC-GPM) based on MC-GPM, considering that some attributes do not require exact 
matching. However, the current application scenarios of GPM are mostly concentrated on complex 
networks, and there are very few research applications of GPM in the medical field, especially in Medical 
Knowledge Graphs. 


Nowadays, the incidence of breast cancer is getting higher and higher, and the age is getting younger 
and younger. Breast cancer can be divided into ductal carcinoma in situ, lobular carcinoma in situ, invasive 
ductal carcinoma, invasive lobular carcinoma, and so on. Each type of breast cancer can be divided 
according to the primary tumor staging, regional lymph node staging, and distant metastasis staging. The 
purpose of this paper is to make a diagnosis through GPM technology before the patient’s condition is 
diagnosed with surgery. 


In this paper, to introduce GPM into the medical field, we propose the problem of GPM in MKGs and 
give relevant definitions. In addition, the M-TBRE algorithm is proposed, which firstly divides the pattern 
graph into pattern subgraphs, then obtains the matching results of the pattern subgraphs, and finally merges 
the matching results of the pattern subgraphs. M-TBRE can give the diagnosis distribution of the pattern 
graph, and return the best k diagnosis classification results according to the frequency of each diagnosis 
classification. Fuzzy constraints are also introduced in the M-TBRE algorithm, which extend it to the Fuzzy- 
M-TBRE algorithm, and the effectiveness of our algorithm are verified on two public data sets. 
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The rest of this paper is organized as follows. The related work of GPM is reviewed in Section 2. Then 
in Section 3, the concept of pattern matching in MKGs is introduced. Section 4 proposes a multi-threaded 
bidirectional routing exploration algorithm and a Fuzzy-M-TBRE algorithm to process GPM in MKGs. 
Section 5 introduces the data sets and conducts experiments to verify our proposed Fuzzy-M-TBRE algorithm, 
and Section 6 concludes our work in this paper. 


2. RELATED WORK 


According to the judgment based on bijective function or based on binary relationship, the research on 
GPM can be divided into isomorphism-based GPM and simulation-based GPM. 


2.1 Isomorphism-Based GPM 


Isomorphism-based GPM has a bijective function between the pattern graph and the data graph, and the 
topological structure of the matched data subgraph and pattern graph must be the same. Ullmann [15] first 
proposes a matching algorithm based on depth-first search. Cordella et al. [16] improve Ullmann’s algorithm 
in terms of matching order and pruning strategy, and propose the VF2 algorithm. Tong et al. [17] propose 
the G-Ray method, which uses the goodness function to measure the degree of matching between a 
subgraph and the pattern graph, so that the optimal-k subgraphs can be returned. In addition, Cheng et al. 
[18] also propose a top-k matching algorithm, which sorts the matched subgraphs obtained based on the 
number of spanning trees, thereby returning the optimal-k subgraphs. Cheng et al. [19] propose the R-join 
algorithm based on the join index of the clustering graph and optimize the algorithm. Other representative 
algorithms include DDST algorithm [20], IncBMatch algorithm [21], and so on. Generally, as an NP-complete 
problem, Isomorphism-based GPM uses indexing [22,23] and parallel distributed [24-26] to improve the 
efficiency of matching. 


Isomorphism-based GPM is mostly used in fields with strict structural requirements such as protein 
isomorphism, 3D object matching [27] and network abnormal behavior detection [28]. However, such 
matching is too strict for applications such as social networks or knowledge graphs that do not require strict 
matching accuracy. Therefore, simulation-based GPM research has emerged. 


2.2 Simulation-Based GPM 


As judged through binary relations, graph simulation is introduced by Henzinger et al. [29], but it is still 
an edge-to-edge matching, which cannot meet the requirements of many applications. Fan et al. [12] extend 
the graph simulation and propose a bounded simulation, where the edge of the pattern graph can be 
matched to a path, and the length of this path does not exceed the given constraint value k. Based on the 
bounded simulation, Ma et al. [30] propose a strong simulation, which can well preserve the topological 
structure of the pattern graph. There is a lot of attribute information on vertexes and edges in big graph 
data, but these existing work does not consider this information. Liu et al. [13] consider this information 
to extend the bounded simulation to MC-GPM and propose a baseline algorithm based on exploration and 
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a heuristic algorithm based on data graph compression index (HAMC). Since the HAMC algorithm only 
considers the constraint conditions of the matching path, which does not consider the minimization of the 
matching path length and the HAMC algorithm does not support a distributed computing structure, Liu et 
al. [31] propose an M-HAMC algorithm. Considering that the attribute values on vertexes and edges 
sometimes do not need to be exactly matched, on the basis of MC-GPM, Liu et al. [14] propose an MFC- 
GPM and an ETOF-K algorithm, which improves matching efficiency from two aspects: edge matching and 
edge connection. Based on the topologically ordered sequence of pattern graph vertexes, Liu et al. [32] 
propose the NTSS algorithm and optimize the algorithm by introducing two measures: caching mechanism 
and reverse edge matching. The caching mechanism solves the problem of repeated calculation of the same 
candidate path in multiple matching subgraphs, and the reverse edge matching prunes the candidate set 
of the edge with a partial degree of entry 0 in advance. 


3. GRAPH PATTERN MATCHING 


GPM is to find all data subgraphs that satisfy the pattern graph G, in a given data graph Gp. In this 
section, we will give the relevant definitions of data graphs, pattern graphs, and graph pattern matching in 
MKGs. 


3.1 Data Graph and Pattern Graph 


The related definitions of the data graph and the pattern graph are as follows. 


3.1.1 Data Graph 


A data graph Gp = (V, E, fP, fP) is a directed graph with vertex attributes and edge attributes, where 


e Vis the set of vertices of the data graph; 
e E is the set of edges of the data graph, and (v, v) € E represents the directed edge from vertex v; € 
V to vertex v; € V; 
— f/ is a function defined on V. Vv e V, fP(v) is the attribute set of v. In an MKG, each vertex v has 
a label p,, and p, represents the type of this vertex. The value of p, is different, and the other 
attributes in the attribute set fP(v) are also different. The value of p, can be DI, BI, MI, GW, OC, 
AL and PD; 
— fẹ is the function defined on E. Ve € E, f;?(v;, v) is the attribute set of e. In an MKG, for a directed 


pids pids 


edge (v, v), f(v, v), only contains Pin, Pin is a list that stores patient numbers, that is, the 


identity information of vertex which comes from those patients; 


DI: When the value of p, is DI, the attribute set fP(v) of vertex v describes the diagnostic classification 
information of breast cancer, which includes pathological information p”, T staging stage p*, tumor length 
p", the description of regional lymph nodes N staging stage p'™, and M staging stage p™ describing distant 
metastasis. The value of p” can be O, 1, 2, and 3 respectively representing “invasive ductal carcinoma”, 
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“invasive lobular carcinoma”, “ductal carcinoma in situ”, and “lobular carcinoma in situ”; the value of p5 
can be 0, 1, 2, 3, and 4; p" is a floating point number in cm. The value of o can be NO, N1, N2, and 
N3; the value of pọpP™M can be MO, and M1. 


BI: When the value of p, is BI, the attribute set fP(v) of vertex v describes the basic information of the 
patient, which includes ps, p© and p°. p°N indicates whether the patient currently needs care, and its 
value is true or false; p© indicates that the patient is currently pregnant, and its value is true or false; p°8° 
indicates the current age of the patient, and its value is a positive integer. 


MI: When the value of p, is MI, the attribute set f°(v) of vertex v describes the patient's menopausal 
information, and it only contains p™S. The value of pMS can be O, 1, and 2, respectively, indicating 
pre-menopausal, perimenopausal, and post-menopausal. 


GW: When the value of p, is GW, the attribute set fP(v) of vertex v describes the patient’s current overall 
well-being, and it only contains p°®. The value of p°”® can be 0, 1, 2, 3, and 4, respectively, which means 
“fully active, no complaints or symptoms”, “doing normal activities requires a little effort”, “occasionally 


need help, but can meet most of the personal needs”, “needs a lot of assistance and frequent medical care”, 
“completely disabled, can only lie in a bed or a chair.” 


OC: When the value of p, is OC, the attribute set f(v) of vertex v describes whether the patient has 
cancers other than breast cancer, which includes p°“ and p°°’. When the value of p°° is false, the value 
of p°’ is none; when the value of p°S is true, the value of p°°N is the names of the patient’s other cancers; 


AL: When the value of p, is AL, the attribute set fP(v) of vertex v describes the patient's axillary lymph 
nodes, which includes p", p", p™, p°™ and p®™. The value of p" is true or false, indicating whether the 
patient’s axillary lymph nodes are normal or not. The value of p' is true or false, indicating whether the 
supraclavicular lymph nodes of the patient are normal or not. The value of p° is true or false, indicating 
whether the subclavian lymph nodes of the patient are normal or not. The value of p®™ is true or false, 
indicating whether the patient’s chest wall is normal or not. The value of p" is a positive integer, which 
means that several of the three of the patient’s supraclavicular lymph node, subclavian lymph node, and 
chest wall have problems. 


PD: When the value of p, is PD, the attribute set of vertex v describes some diagnosis information of the 
patient in the past, which includes p74, p°, p, pl", p42, p%", pt and p. The value of p! is true or false, 
indicating whether the patient has AIDS; the value of p*"° is true or false, indicating whether the patient has 
anemia; the value of pọ™ is true or false, indicating whether the patient has autoimmune disease; The value 
of p™" is true or false, indicating whether the patient has lung cancer; the value of p“* is true or false, 
indicating whether the patient has diabetes; the value of p“' is true or false, indicating whether the patient 
has cardiovascular disease; The value of p°* is true or false, which indicates whether the patient has 
osteoporosis; the value of p is true or false, which indicates whether the patient’s reproductive organs are 
diseased. 
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3.1.2 Pattern Graph 


A pattern graph Gp = (V, Ep, fv”, fe, ff, ff) is a directed graph with vertex attributes and edge attributes, 
where: 


e V, is the set of vertices of the pattern graph; 

e E, is the set of edges of the pattern graph, and (u;, u) € E, represents the directed edge from vertex 
u; € Vp to vertex uj € Vp; 

e fy is a function defined on V,, and Vv e V, f(v) is the attribute set of v. In an MKG, the function 
f°(u) corresponding to the vertex u has the same meaning as the attribute set of the above vertex in 
the data graph. 

e f is the function defined on E,. Ve € Ep, f,P(e) is the attribute set of e. In an MKG, for a directed edge 
(u, u), felu, u) only contains pie - Po 
information of the vertices comes from those patients. 

e ff is the function defined on E,. V(u;, u) € E,, ff(u, u) is the length constraint of the edge (u, u), and 


is a list that stores patient numbers, that is, the identity 


its value is a positive integer k or a symbol *, respectively, indicating that the length of the interval 
from v; to v; does not exceed k, or there is no length limit. In an MKG, f/(u, u) = 1; 
e ff is a set of membership constraint functions defined on vertex attributes and edge attributes. 


3.1.3 Fuzzy constraints 


During matching, it would be better to get more and better matching results. Because in the actual 
matching, each matched subgraph corresponds to a patient who has roughly the same health information 
as the patient to be diagnosed in the pattern graph. The more obtained matches, the better experience will 
be used for reference in the treatment of patients corresponding to the pattern graph. However, in practice, 
it is possible that a subgraph in a data graph can be well satisfied with other constraints, but because some 
less important attribute constraints on a vertex cannot be satisfied, the subgraph cannot become a matching 
result. In addition, some attribute constraints on vertexes do not need to be accurately matched when 
matching, and their differences only need to fall within a certain range. Therefore, we introduce fuzzy 
constraints to GPM in MKGs. 

In the MKG, the membership function f? = {f” 


age 


} is only considered to introduce a fuzzy constraint to 
the age attribute. fz, represents the membership function defined on the vertex age attribute Page. The 
constraint value of fge is set to 3. The membership function fe, is defined as Eq. (1), where abs is the 
absolute value function, 7 represents the age attribute constraint value of vertex v in pattern graph Gp. 


age 


Pu 
attribute page only needs to satisfy le <3. 


represents the age attribute constraint value of vertex u in data graph Gp. During matching, age 


f= abs(p?® — p% ) (1) 


age T u 
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3.2 Pattern Matching 
The matched subgraph Gy, = (Vur Ew hg, rfg) is a subgraph of the data graph Gp and matches the 


pattern graph Gp. The number of matched subgraphs may not be unique, where Gub € Gp, Vab C V, Esu © 
E, fp, Cf, fe. Cf ; The definition of pattern matching in the MKG is as follows. 


For a pattern graph Gp = (V,, Ep, A”, f°, ff, fn) and a data graph Gp = (V, E, fP, fP), Gp matches Gp, 
denoted as Gp < Go, if there is a binary relationship: 


e for all u e Vp, there is v e V such that (u, v) e S, which means that there is a vertex v in V that 


matches u, that is, v satisfies f/(u). If age attribute p° is included, f? = {f"} represents the 


age 


only needs to satisfy ff", <3 . Except 


age 


age 


membership function defined on the age attribute of u, then p; 


for the age attribute p** 


u 


, the values corresponding to the other attributes of v must be equal to the 
values of the attributes corresponding to u before it can be determined that v; matches u;. 

e for each pair (u, v) € S, 
* u; ~ vy and 
* for each edge (u; u) in Ep, there is a path from v; to v; in Gp such that (u, v) € S. Because of 
f(u, u) = 1, this path can be regarded as the edge from v; to v; in Gp; 


Example 1: As shown in Figure 1, Gp is a data graph composed of related information of multiple breast 
cancer patients. The attribute information of some vertexes contained in the data graph saves the diagnostic 
classification information of breast cancer. In the data graph Gp, each vertex represents some information 
of the patient. For the function f;?(A,, B4) defined on the directed edge (A4, B,) in Gp, f(A, B4) only contains 

pids 


the attribute px, . For example, the value of oe 


the B, vertex comes from the breast cancer patient numbered 1375. The pattern graph Gp is the health 


is 1375, which means that the relevant information on 


status of a patient to be diagnosed. The vertices B, C, D, E, F, and G respectively represent the patient's 
basic information, menopausal status, general well-being, information on cancers other than breast cancer, 
axillary lymph nodes and information about past diagnoses. Vertex A is the diagnostic information of this 
patient, but it is unknown and needs to be obtained through GPM. Since all vertex information in the 
pattern graph comes from the same patient, we need to find a patient number as the attribute constraint 
information on the edges to get the matching result of the pattern graph. 
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Figure 1. Data graph and pattern graph in an MKG 


Example 2: As shown in Figure 1, it is easy to find a subgraph M,» from data graph Gp that matches 
pattern graph Gp. M.b passes through vertexes A,, B}, C, D2, E, F, and G;. Vertex A, is the breast cancer 
diagnosis result of the pattern graph Gp. The attribute constraint value on the edges in M,,,, is 2384, which 
means that the patient with the number 2384 is closer to the health status of the patient corresponding to 
Gp. 

age 


After introducing fuzzy constraints, since fgs = abs(p;* — p3) =1, it does not exceed the membership 


a 


function constraint value 3 on the age attribute. In addition, ps = 5,05. = P,P" = Ps, , and vertex B 
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matches vertex B}. We can get a new matched subgraph M, that passes through vertexes A,, B, G, Dy, 
E,, F, and G3. The attribute constraint value on the edges in M,,,. is 676. 


4. GRAPH PATTERN MATCHING IN MEDICAL KNOWLEDGE GRAPHS 


In this section, we propose a multi-threaded bidirectional routing exploration algorithm M-TBRE to solve 
the GPM problem in MKGs. 


4.1 Algorithm Description 


The emergence of multi-core CPU can realize the parallel processing of tasks and speeds up the execution 
of programs. Since the multi-constrained GPM problem is an NP-complete problem, in order to speed up 
the matching speed and return the matched results quickly, here we consider adopting multi-threading to 
solve this GPM problem. In the matching process, the idea of divide and conquer is adopted. For a pattern 
graph Gp, it can be divided into several pattern subgraphs. After the matching of each pattern subgraph is 
completed in the data graph Gp, the matched results of each pattern subgraph can be connected to 
obtaining the matched results of the pattern graph Gp. The matching of pattern subgraphs can be delivered 
as subtasks to multiple threads to complete independently, so that matching results can be obtained quickly. 


4.2 Algorithm Flow 


In the M-TBRE algorithm, since the pattern graph of the MKG can be regarded as a path, we can segment 
the pattern graph according to the intermediate vertexes of this path, divide the pattern graph into two 
parts, and obtain two pattern subgraphs. Next, to match the two pattern subgraphs, the matched results are 
connected to obtaining the matched results of the pattern graph. 


The detailed steps of the M-TBRE algorithm are shown in Algorithm 1. First, the intermediate vertex V;"4 
of the pattern graph G, and the candidate vertex set candmia of V;""" need to be obtained, as shown in lines 
1-2. In line 3, pool and tempinfo represent the thread pool and the temporary result of the matching, 
respectively. The number of threads in pool can be set according to the actual situation. Then the pattern 
graph Gp is divided into two sub-pattern graphs G°! and GP? with intermediate vertex VP! as the 
dividing point, and the two sub-pattern graphs are matched in the data graph Gp. Traversing the candidate 
vertex set cand, of Vp“! to complete the matching of the sub-pattern graphs G! and G®}?, as shown 
ids 


in lines 4-27. For each element candmiali] in Candia, We use the attribute constraint pf-° on each forward 


edge ef of cand,,,[i] to intersect the attribute constraint p®* on each successor edge e to obtain p?', 
which saves the common patient number information of the current forward edge e? and the current 
successor edge ef , as shown in lines 6-25. For each patient number p"! in př, pr" is taken as the attribute 
constraint on the edge to complete the matching of G$! and G?, as shown in lines 14-21. templnfo 
stores the partial matched result with p"! as the edge attribute constraint, as shown in lines 15-16. The 


thread pool submits subtasks MC-SEM and MC-FEM to complete the matching of Gs?! and Gs? 
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respectively, as shown in lines 19-20. The algorithm RM merges the matched results, as shown in line 28. 
The MC-SEM algorithm can complete the matching of the pattern subgraph G3"”', where ve, yem, ppid 
and tempinfo respectively represent the pattern vertex to be matched, the candidate vertex of the pattern 
vertex v" to be matched, the attribute constraint value (patient number) of the edge, and the temporary 
result of the matching. If vertex v;"" matches vertex v" but vg" does not have a successor edge, that is, 
when the out-degree of v$" is 0, the matching of the pattern subgraph G;"”' is completed, and the matched 
result when p" is used as the attribute constraint on the edge is saved in tempinfo, such as Algorithm 2 is 


curr 


shown in lines 2-7. If vertex vp" matches vertex vp™ and v;"" has a successor edge, then traverse the 
successor edge e% of vj". When the attribute constraint p?** on e% includes p?, the matching of pattern 


vertex e°. tailNode is recursively completed, as shown in lines 8-16. 


Algorithm 1 M-TBRE Algorithm 
Input: Gp,Gp 
Output: result set Msub 

1; Get the intermediate node of the pattern graph G p,v'™'d 


2: Get the candidate node set of Nmiq from V,candmia 
3: initialize: pool, tempInfo 

4 i=0 

5: while i < candmia-length do 

6: vernd = Candmia [i] 

7: ey = væ firstPreEdge 

8: while e}; # NULL do 

9: phe = eP pids 

10: ep = Veng firstA fterEdge 

Il: while e} + NULL do 

12: př“ = e's .pids 

13: pPids = priss n pua 

14: for pt e pr's do 

15: initialize: state [3] 

16: tempInfo.put (p?', state) 

17: eP = Nia. firstPreEdge 

18: e A = Nmia-firstAfterEdge 

19: pool.submit (MC-SEM, e#*.tailNode, e¥*.tailNode, pP, tempInfo) 
20: pool.submit (MC-FEM, e} .tailNode, e!) .tailNode, p4, tempInfo) 
21: end for 

22: ep = ey -nextEdge 

23: end while 

24: eS = ef -nextEdge 

25: end while 

26: i++ 

27: end while 


28: Msub = RM (tempInfo) 
29: return Msub 
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Algorithm 2 Multi-Constrained Subsequent Edge Matching. MC-SEM 
Input: vp” vp pP.tempInfo 
1: if vo" equals (v>"") then 


2: ep = vp" firstA fterEdge 

3i if e == NULL then 

4: state [] = tempInfo.get(p?") 

5: state [1] = 1 

6: return 

a end if 

8: ep = Vp -firstAfterEdge 

9: while ef; + NULL do 

10: pS = ei pids 

I: if p?" e pP” then l 
12: MC-SEM (e#y.tailNode, e} .tailNode, p?4, tempInfo) 
13: return 

14: end if 

15: et, = ep -nextEdge 

16: end while 

17: end if 


The MC-FEM algorithm can complete the matching of the pattern subgraph G3}? . The processing process 
of the MC-FEM algorithm is similar to that of the MC-SEM algorithm, except that MC-FEM completes the 
matching of the pattern subgraph G”? according to the reverse depth-first search strategy. 


Algorithm 3 Multi-Constrained Forward Edge Matching, MC-FEM 
Input: v3" veur oP tempInfo 

l: ee = vo" firstPreEdge 

2: if e == NULL then 

3: state [] = tempInfo.get (p?'") 


4: state [0] = 1 

5: Save the attributes of v5)" in state [2] 
6: return 

7: end if 

8: if vý equals (v6"") then 

9: eh = vl" firstPreEdge 

10: while ef + NULL do 

11: pee = el .pids 

12: if prid e ph then 

13: MC-FEM (ey .tailNode, e? .tailNode, pP, tempInfo) 
14: return 

15: end if 

16: eh = eh nextEdge 

17: end while 


18: end if 
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The RM algorithm can complete the connection operation of the matching results of pattern subgraphs 
Gp! and Gs’?. When a given value is used as an attribute constraint on all edges, and the flag bits 
representing the matching results of G}®' and G*”? are both 1, then combining the matching results of 
Ge" and G$}? is a matching result of the pattern graph Gp, such as lines 4—6 in Algorithm 4. 


Algorithm 4 Result Merge,RM 
Input: tempInfo 
Output: result set Mres 

1: for entry € tempInfo do 


2: pPd = entry.key 

a: state = entry. value 

4: if state [0] == 1 and state [1] == 1 then 
5: state [2] .add(pid) 

6: Myes.add (state [2]) 

T end if 

8: end for 


9: return Mres 


Example 3: In this example, Gp and Gp in Figure 1 are the pattern graph and the data graph, respectively. 
First, to obtain intermediate vertex D of pattern graph Gp and candidate vertex set candmia = {D>} of D. The 
pattern graph Gp is divided into pattern subgraph G}! which passes through vertexes A, B and C, and 
pattern subgraph G®? which passes through vertexes E, F and G. The forward edge (C,, D,) and the 
subsequent edge (D,, E) of D, have the same attribute constraint p"! = {676, 2384}. Taking the matching 
of Gs”? as an example, the attribute constraint PÈD, = {676,2384} on edge (C,, D,) contains p"! = 2384, 
and C, matches C at the same time, so edge (C,, Da, Gp) œ (C, D, Gp). We can get (C,, Da Gp) = (C, D, 
Gp) and (A, B», Gp) = (A, B, Gp). The matching of G$»? takes pP! = 2384 as the attribute constraint, and 
the attribute constraint information of vertex A, is the diagnosis classification result of G$”? . In the same 
way, we can get the matched results (D,, E Gp) = (D, E, Gp), (E, Fo, Gp) = (E, F, Gp), (Fo, G3, Gp) & (F, 
G, Gp) of Gè”?! when p?i4 = 2384 is the attribute constraint on edge Ep. Both G$! and Gs"? have matched 
results, so the diagnostic classification information of Gp is the diagnostic classification information of 
Gs", and the patient number in Gp is 2384. Finally, two matching subgraphs are obtained through the 
M-TBRE algorithm. Muvi = {Vm Em fw fo}, Where Vu = {Az Bs, Ci, Dz, Ey, Fo, G3}, fe = {2384} and Ey, = {(A2, 
B2), (Bz, Cy), (Gy, Də), (Da, Ei), (Et, Fo), (Far G3)}- Meub2 = {Viv Emr fy, fel, where Vy = {Az Bs, Cy, Da, Fi, Fa, 
G3}, fe = {676} and Fy, = {(A2, Bs), (Bs, Cy), (G, D2), (D2, E), (E, F), (Fo, G3)}. The patients numbered 676 
and 2384 can be used for reference when treating the patients corresponding to the pattern graph Gp. 


5. EXPERIMENTS 


In this section, we conduct experiments on two public MKGs. The details of these two datasets are shown 
in Table 1. We propose and implement the M-TBRE algorithm to complete the pattern matching of MKG. 
Since the M-TBRE algorithm divides the pattern graph into two sub-pattern graphs, for different edge 
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attribute constraint values p™¢, the matching of these two sub-pattern graphs is delivered to the thread pool 
as a subtask for execution. More or less the number of threads in the thread pool will affect the execution 
results of the algorithm. We will set a different number of threads to measure the efficiency dynamics of 
the M-TBRE algorithm. In addition, to obtain more matched results, we introduce the fuzzy constraint, 
which is a membership function for the age attribute constraint in the vertexes. The age attribute constraint 
on the data graph vertexes does not need to be the same as the attribute constraints in the pattern graph 
during matching, but only needs to go through the calculated result of the membership function and satisfy 
the corresponding membership constraint value. Together with the M-TBRE algorithm, we have the Fuzzy- 
M-TBRE algorithm. The Fuzzy-M-TBRE and M-TBRE algorithms can be compared to prove the effectiveness 
of the introduced fuzzy constraints. 


Table 1. The detail information of two datasets. 


Dataset Vertices Edges Description 
Female-breast-cancer-2013a 10812 20366 A graph about breast cancer patients 
Breastcancer-femalepatient-2016A 101221 200845 A graph about breast cancer patients 


5.1 Experimental Settings and Implementation 


The MKG used in the experiment is about breast cancer. Dataset-1 and dataset-2 are used to represent 
the dataset Female-breast-cancer-2013a and the dataset Breastcancer-femalepatient-2016A, respectively. 
Dataset-1 is composed of the physical condition information of 10,000 breast cancer patients, and dataset-2 
is composed of 100,000 breast cancer patients. In our experiment, several pattern graphs are used, but 
these pattern graphs are similar to the pattern graph shown in Figure 1. Our membership function is only 
for the age attribute of the vertex, and the membership constraint value is set to 3. Both M-TBRE and Fuzzy- 
M-TBRE are implemented using Java and running on a PC with Intel(R) Core(TM) i9-10900F CPU @2.81G 
GHz, 32 GB RAM and Windows 10 operating system. 


5.2 Experimental Results and Analysis 
5.2.1 Experiments on Execution Time 


This experiment studies the execution time change when we set different thread numbers for the thread 
pool used in the M-TBRE algorithm, and the algorithm completes the GPM. To prevent error interference, 
the results in the experiment are the arithmetic mean after 10 runs. 


As shown in Figure 2 and Figure 3, the abscissa represents different pattern graphs, and the ordinate 
represents the matching time of these pattern graphs. The M-TBRE-1 algorithm represents that the number 
of threads in the thread pool in the M-TBRE algorithm is set to 1. Since the pattern graph in the MKG is a 
path, and the edge join strategy proposed in the ETOF-K algorithm does not take effect in the matching, 
the performance of the ETOF-K algorithm and the M-TBRE-1 algorithm is almost the same on dataset-1 and 
dataset-2. The reverse matching strategy of the NTSS algorithm is invalid in the matching process, but its 
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caching mechanism avoids the double calculation of the same path, so the NTSS algorithm is better than 
the ETOF-K algorithm and the M-TBRE-1 algorithm on dataset-1 and dataset-2. 
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Figure 2. Matching time of different pattern graphs on Female-breast-cancer-2013a. 
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Figure 3. Matching time of different pattern graphs on Breastcancer-femalepatient-201 6A. 


However, our M-TBRE-1 algorithm can be extended to multithreaded algorithms, such as the M-TBRE-2 
algorithm, M-TBRE-4 algorithm, M-TBRE-8 algorithm, which means that the number of threads in the thread 
pool is set to 2, 4, and 8, respectively. As can be seen from Figure 2 and Figure 3, the effect of the M-TBRE-2 
algorithm has already exceeded the NTSS algorithm, which also proves the effectiveness of our proposed 
M-TBRE algorithm. In addition, Table 2 and Table 3 show the detailed execution time in seconds. Table 4 
shows the comparison of the average execution time of these four algorithms on the two datasets. It can 
be seen that on the two data sets, as the number of threads increases, the execution time of the algorithm 
continues to decrease. 
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Table 2. Execution time on the Female-breast-cancer-2013a dataset. 


P1 P2 P3 P4 P5 P6 P7 P8 

ETOF-K 0.1625 0.1933 0.1742 0.1659 0.1743 0.1601 0.1742 0.1633 
NTSS 0.1398 0.1472 0.1109 0.1365 0.1247 0.0973 0.1148 0.1245 
M-TBRE-1 0.1832 0.1869 0.1516 0.1807 0.1654 0.1560 0.1535 0.1700 
M-TBRE-2 0.1161 0.1097 0.0905 0.1059 0.0890 0.0785 0.0789 0.0979 


M-TBRE-4 0.0869 0.0709 0.0488 0.0595 0.0550 0.0459 0.0448 0.0512 
M-TBRE-8 0.0750 0.0488 0.0430 0.0373 0.0366 0.0311 0.0294 0.0331 


Table 3. Execution time on the Breastcancer-femalepatient-201 6A dataset. 


P1 P2 P3 P4 P5 P6 P7 P8 
ETOF-K 25.1406 26.1477 29.3101 28.1559 26.7859 27.4765 28.6518 27.4800 
NTSS 20.7231 21.6519 19.9583 20.4127 19.8864 22.7456 19.7193 21.5986 


M-TBRE-1 27.2807 27.6337 28.7748 27.8562 27.7634 27.1356 26.8715 27.3600 
M-TBRE-2 16.2298 16.3104 16.1557 16.3067 16.6475 16.5609 16.4358 16.7567 
M-TBRE-4 9.4020 9.7056 9.8746 9.9239 9.7943 9.0300 8.9316 8.9685 
M-TBRE-8 5.8198 5.9732 6.0509 6.0256 5.9963 6.1483 6.1579 6.1895 


Table 4. The comparison of execution time on two datasets. 


Dataset M-TBRE-1 M-TBRE-2 M-TBRE-4 M-TBRE-8 Percentage 
Female-breast-cancer-2013a 0.1684 0.0958 — — 43.11% 
Female-breast-cancer-2013a — 0.0958 0.0579 — 39.56% 
Female-breast-cancer-2013a — — 0.0579 0.0418 27.81% 
Breastcancer-femalepatient-2016A 27.5845 16.4254 — — 40.45% 
Breastcancer-femalepatient-2016A — 16.4254 9.4538 — 42.44% 
Breastcancer-femalepatient-2016A — — 9.4538 6.0452 36.06% 


e For dataset-1, the execution time of the M-TBRE-2 algorithm is increased by 43.11% compared with 
the M-TBRE-1, and the execution time of the M-TBRE-4 algorithm is increased by 39.56% compared 
with the M-TBRE-2 algorithm, but compared with the M-TBRE-4 algorithm, the execution time of the 
M-TBRE-8 algorithm is only increased by 27.81%. This is because dataset-1 itself is small in scale, 
and the time spent on thread context switching and system state transitions occupies a large proportion 
of the total time. 

e For dataset-2, its scale is larger, and the total execution time of the algorithm is also larger. The time 
spent on thread context switching and system state transition takes up a relatively small proportion 
of the total time. Therefore, M-TBRE-8 in dataset-2 still increased by 36.06%. 


For our proposed M-TBRE algorithm, as the number of threads increases, the execution speed is also 
accelerating. But when the number of threads reaches a certain level, the increase in execution speed will 
slow down, as shown by M-TBRE-8 in Figure 2. If the dataset is larger, or when the M-TBRE algorithm 
submits more subtasks, this slowing downtrend will become slower. Compared with the M-TBRE-4 algorithm, 
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the M-TBRE-8 algorithm in dataset-1 increased by 27.81%, while in dataset-2, the M-TBRE-8 algorithm 
increased by 36.06% compared with the M-TBRE-4 algorithm. 


5.2.2 Experiments on Fuzzy Constraints 


This experiment studies the change in the number of matching subgraphs when our M-TBRE algorithm 
introduces fuzzy constraints. The Fuzzy-M-TBRE algorithm is the algorithm after M-TBRE introduces fuzzy 
constraints. Since our Fuzzy-M-TBRE algorithm can get all the matched results of the pattern graph, we can 
compare the changes in the total number of matches before and after the introduction of fuzzy constraints. 


As shown in Figure 4 and Figure 5, the abscissa represents different pattern graphs, and the ordinate 
represents the number of matched subgraphs. On dataset-1 and dataset-2, for the same pattern graph, the 
Fuzzy-M-TBRE algorithm returns more matched results than the M-TBRE algorithm. Each matched subgraph 
corresponds to a breast cancer patient. When treating the patient corresponding to the pattern graph, please 
refer to the treatment plan of the corresponding patient in these matched subgraphs. Introducing fuzzy 
constraints can get more treatment options. Therefore, it is necessary to introduce fuzzy constraints into the 
M-TBRE algorithm. 
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Figure 4. The number of matched subgraphs of different pattern graphs on Female-breast-cancer-2013a. 
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Figure 5. The number of matched subgraphs of different pattern graphs on Breastcancer-femalepatient-2016A. 
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6. CONCLUSION 


In this paper, we put forward the problem of GPM in MKGs, and provide related definitions. In order to 
solve this problem, an M-TBRE algorithm is proposed, which divides the pattern graph into several pattern 
subgraphs, uses multi-threaded bidirectional routing to complete the matching of the pattern subgraphs, 
and then merges the matching results. In addition, fuzzy constraints are introduced to obtain more matching 
subgraphs. Each matched subgraph corresponds to a past patient. The patients corresponding to these 
matched subgraphs have the same physical condition as the patient corresponding to the pattern graph, so 
the treatment plan of the patients corresponding to these matched subgraphs can be used for reference in 
the treatment of the patient corresponding to the pattern graph. In this way, better and more effective 
treatment plans can be developed for patients corresponding to the pattern graph. We conduct verification 
experiments on the M-TBRE algorithm on two public MKG datasets. Experimental results show that our 
proposed M-TBRE algorithm has better performance. Furthermore, the necessity of introducing fuzzy 
constraints is also demonstrated, which leads to the outperformance of the Fuzzy-M-TBRE algorithm. In the 
future, we will further research and improve the M-TBRE algorithm, and study the dynamic graph pattern 
matching problem in MKGs oriented to the dynamics of pattern graph content. 
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